2  Data

2.1 Read Data and Description

For this project, we utilized the FoodData Central API provided by the United States Department of Agriculture (USDA) to access detailed nutritional data. This API is designed to help developers incorporate nutrient information into their applications by providing comprehensive documentation about the database structure and data elements. Using this API, we collected data on 10,000 branded food items, capturing 24 features such as brand name, food category, package weight, ingredients, and various nutrients and vitamins. The data were exported to a CSV file and imported into RStudio for analysis. While the dataset is rich in information, we encountered missing values in some fields, particularly for ingredients, as not all foods have complete nutritional details. Despite this, the data offer valuable insights into nutrient composition, enabling us to identify and address gaps in dietary information. The data source is well-documented and regularly updated by the USDA, ensuring reliability and accessibility for research purposes.

2.1.1 Load required library

Code
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(dplyr)
library(readr)

2.1.2 Data Overview

The dataset was downloaded from xxxxxxxxxxxxxxxxxxxx and saved as “Branded Nutrients 10000.csv”.

Code
data <- read_csv("Branded Nutrients_Dataset.csv",
                 show_col_types = FALSE)

In this dataset, there are 10,000 rows of data, where each row refers to a product. There are total 24 columns including information listed below.

  • “fdcId”

  • “description”

  • “foodCategory”

  • “brandOwner”

  • “brandName”

  • “packageWeight”

  • “publishedDate”

  • “Protein G”

  • “Total lipid (fat) G” 

  • “Carbohydrate, by difference G”

  • “Energy KCAL”

  • “Total Sugars G”

  • “Fiber, total dietary G”

  • “Calcium, Ca MG”

  • “Iron, Fe MG”

  • “Sodium, Na MG”

  • “Vitamin C, total ascorbic acid MG”

  • “Cholesterol MG”

  • “Fatty acids, total trans G”

  • “Fatty acids, total saturated G”

  • “Vitamin A UG”

  • “Potassium, K MG”

  • “Vitamin D (D2 + D3) UG”

  • “Sugars, added G”

Display the first few rows

Code
head(data)
# A tibble: 6 × 24
    fdcId description            foodCategory brandOwner brandName packageWeight
    <dbl> <chr>                  <chr>        <chr>      <chr>     <chr>        
1 2617100 ALL NATURAL GLUTEN FR… Frozen Poul… Golden Pl… GOLDEN P… 24 oz        
2 2604504 ALL NATURAL ROSEMARY … Flavored Ri… SLT Foods… HERITAGE… 6.5 oz       
3 2570670 ARTISANAL COLLECTION … Pasta by Sh… Barilla G… BARILLA   1 lb         
4 2607109 AUTHENTIC BARREL RIPE… Cheese       Wdh, LLC   APHRODITE 6 oz         
5 1861692 BERRY NUT BLEND BREAK… Popcorn, Pe… Snyder's-… EMERALD   212.5 g/7.5 …
6 2624305 BLACK CHERRY PURE ENE… Soda         Energy Be… TRUE NOR… 12 fl oz     
# ℹ 18 more variables: publishedDate <chr>, `Protein G` <dbl>,
#   `Total lipid (fat) G` <dbl>, `Carbohydrate, by difference G` <dbl>,
#   `Energy KCAL` <dbl>, `Total Sugars G` <dbl>,
#   `Fiber, total dietary G` <dbl>, `Calcium, Ca MG` <dbl>,
#   `Iron, Fe MG` <dbl>, `Sodium, Na MG` <dbl>,
#   `Vitamin C, total ascorbic acid MG` <dbl>, `Cholesterol MG` <dbl>,
#   `Fatty acids, total trans G` <dbl>, …

Check column names

Code
colnames(data)
 [1] "fdcId"                             "description"                      
 [3] "foodCategory"                      "brandOwner"                       
 [5] "brandName"                         "packageWeight"                    
 [7] "publishedDate"                     "Protein G"                        
 [9] "Total lipid (fat) G"               "Carbohydrate, by difference G"    
[11] "Energy KCAL"                       "Total Sugars G"                   
[13] "Fiber, total dietary G"            "Calcium, Ca MG"                   
[15] "Iron, Fe MG"                       "Sodium, Na MG"                    
[17] "Vitamin C, total ascorbic acid MG" "Cholesterol MG"                   
[19] "Fatty acids, total trans G"        "Fatty acids, total saturated G"   
[21] "Vitamin A UG"                      "Potassium, K MG"                  
[23] "Vitamin D (D2 + D3) UG"            "Sugars, added G"                  

2.2 Missing value analysis

We noticed that except for fdcId and description, all other columns have missing values in different levels. Therefore, we examine the percentage of missing values for all columns.

Code
missing_summary <- data.frame(
  Column = colnames(data),
  MissingCount = colSums(is.na(data)),
  MissingPercent = colSums(is.na(data)) / nrow(data) * 100
) |> 
  arrange(desc(MissingPercent))

ggplot(missing_summary, aes(x = reorder(Column, -MissingPercent), y = MissingPercent)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, 100, by = 10), limits = c(0, 100)) +  # Adjust y-axis breaks
  labs(
    title = "Percentage of Missing Values per Nutrient",
    x = "Nutrients",
    y = "Percentage of Missing Values"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From above, we can see that there are few missing values for main nutrients such as Protein G, Energy KCAL, Total lipid (fat) G, with less than 1% missing values. Also we have most brandOwner information with 2.27% missing values.

For nutrients that are not always displayed on packages, including Vitamins, Sugars, added G, we have high portions of missing values. Therefore, in this project, we would focus on main nutrients in most of the time, and address supplemented nutrients when it is necessary.